The COVID-19 pandemic, caused by the novel coronavirus SARS-CoV-2, has had an unprecedented impact on global health, economies, and daily life since its emergence in late 2019. As the world fights with the challenges posed by this highly contagious virus, epidemiological data have been continuously gathered and released to the public, driving numerous researches and different approaches in trying to understand its patterns of transmission, to identify vulnerable populations, and to inform public health strategies. Due to the severity of the early stage of the pandemic and its wide impact on global production, data of high quality and accuracy were gathered in the nation through surveys and reports, so we believed that the COVID-19 data sets could be more informative and extensive than other epidemiology data.
In this assignment, we looked into the COVID-19 epidemiology data sets provided by Statistics Canada along with other related data sets. We attempted to answer three major questions in three subsections:
We wanted to find if there was a possible relationship between the COVID pandemic and the death counts for 2020, 2021, 2022 and 2023. Through this question, one might be able to draw insights on whether the virus has had a dangerous impact on the overall public health.
We gathered data of COVID-19 long term symptom among Canadian adults. We wanted to draw some conclusions on whether the virus had any impact on the long-term health condition of Canadians.
We wanted to measure the relationship between the risk prevalence and some factors like vaccination status, chronic conditions and having or not a direct contact with people etc. By building a statistical model between the response and predictors, it helped us understand what procedures or conditions can affect the prevalence of COVID-19.
We used two data sets to explore the relationship between COVID-19 and the mortality in Canada. First data set is focus on the COVID-19 cases and death published by government of Canada to explore the number of new infections and deaths numbers in Canada and updates every Monday morning from Feb.01,2020 to Oct.28, 2023.
This first data set published by government of Canada(Government of Canada 2023a), it contains 2940 observations of 23 variables,including the total number of COVID-19 infections and deaths and their rates from January 2020 until the end of the reporting week, weekly and bi-weekly number of infection and deaths and their rates. Additionally, it includes the average daily death counts and rates derived from both weekly and bi-weekly data. In this section, our analysis emphasizes variables that pertain to both weekly and overall data. The data dictionary detailing the selected variables is provided below.
| Table 2.1.1: COVID-19 Cases and Death Data Dictionary | |||||
| Variables | Type | Example | Number.Unique | PctMissing | Comment |
|---|---|---|---|---|---|
| prname | character | British Columbia, Alberta | 15 | 0% | English name of jurisdiction (province, territory, Canada) |
| date | character | 2020-02-01, 2020-02-08 | 196 | 0% | Last day of the epidemiologic week for which the data represent. Epidemiological weeks are from Sunday to Saturday and this date will always fall on a Saturday. |
| reporting_year | integer | 2020, 2021 | 4 | 0% | The calendar year associated with the epidemiologic week (based on the Fluwatch weeks calendar) in which the data was reported.(2020-2023) |
| totalcases | integer | 1, 0 | 2147 | 0% | The total number of cases reported from January 2020 until the end of the reporting week in a jurisdiction. |
| numtotal_last7 | numeric | 1, 0 | 1407 | 9.42% | Total number of cases during the reporting week for a jurisdiction, minus the total number of cases from that jurisdiction's previous week's update. |
| numdeaths | integer | 0, 1 | 1430 | 0% | The total number of deaths reported from January 2020 until the end of the reporting week in a jurisdiction. |
| numdeaths_last7 | numeric | 0, 1 | 295 | 11.02% | Total number of deaths for a jurisdiction, minus the total number of deaths from that jurisdiction's previous week's update. |
From the Table 2.1.1, we found that the percentage of missing value in weekly cases and death counts are abound 10%, which is not good for our research. Missing values are mainly found in the northern and southeastern provinces and territories, such as Nunavut and Nova Scotia. In order to avoid the impact of missing values on the study, we use the total death and infections of COVID-19 in Canada instead of every province and terrotory in the following discussion.
Second data set is the provisional weekly death counts, by ages and sex from 2010 to 2023, published by Statistics Canada. This data set record the 149730 observations of 17 variables that are relevant for monitoring the impacts of mortality of every province and territory in Canada. We also deleted some variables which are irrelevant with our study or can not delivered the useful information in this data set. Such as variables like STATUS and TERMINATED are missing in all observation in this data set and variables DECIMALS and UOM_ID are the same for all variables. The data dictionary for remaining variables is provided below.
| Table 2.1.2: Weekly Mortality Data Dictionary | |||||
| Variables | Type | Example | Number.Unique | PctMissing | Comment |
|---|---|---|---|---|---|
| REF_DATE | character | 2010-01-09, 2010-01-16 | 713 | 0% | Reference period for the series being released.(2010-2023) |
| GEO | character | Canada, place of occurrence, Newfoundland and Labrador, place of occurrence | 14 | 0% | Name of dimension. There can be up to 10 dimensions in a data table. (i.e. Geography) |
| Age.at.time.of.death | character | Age at time of death, all ages, Age at time of death, 0 to 44 years | 5 | 0% | Age grouo when death occurred |
| Sex | character | Both sexes, Males | 3 | 0% | Sex |
| Characteristics | character | Number of deaths | 1 | 0% | Number of deaths |
| UOM | character | Number | 1 | 0% | The unit of measure applied to a member given in text. |
| VALUE | integer | 4955, 2535 | 1091 | 9.25% | Total number of death under certain characteristics |
The total number of death in this data set exist 9.25% missing data in raw data set, the missing data appears in all data after July 15, 2023. Because we only use the data in Canada with all age group and both sexes, the missing data only accounted for less than 1% of the data set we filtered. Therefore, we our study focused on the overall total death account and the number of COVID-19 deaths in Canada during the period January 2022 to July 2023.
In order to have better understanding about the mortality in Canada, we visualize the weekly death counts every year form 2010 to 2023 in Figure2.1.2, it is clear to see that the the number of annual deaths is increasing every year. The overall trend from 2010 to 2019 is similar, with an general decrease from the begging to the middle of the year then followed by an upward trend until the year end. In the middle of 2020 and the beginning of 2022, there exist two significant spikes on the figure. These pronounced increases in case counts raise the possibility that they may be attributed to distinct outbreaks of the epidemic.
To verify this conjecture, we showed the weekly number of death without the COVID-19 cases in Figure2.1.3. The spikes in 2020 and 2022 are removed but the small spike in mid-2021 still exist. So death counts rapid increase in 2020 and 2022 may caused by COVID-19 and we will discuss the probability of COVID-19 deaths in the total number of death condition on year in the following section.
We used the Canadian COVID-19 Antibody and Health Survey (CCAHS) Cycle 1 microdata in modeling the prevelance. The CCAHS is collecting key information relevant to the pandemic to learn as much as possible about the virus, how it affects overall health, how it spreads, and whether Canadians are developing antibodies against it. (2021) The survey contained two parts, an electronic questionnaire and an at-home blood test. The questionnaire aimed to get general health and exposure conditions of participants, whereas the blood test was used to determine the presence of COVID-19 antibodies.
The survey was designed as cross-sectional and was given to individuals over 1 years old, excluding the population in remote areas of Canada. The data were sampled randomly from 30 strata created from each province. Due to the various size of the population of each stratum, Statistics Canada had to adjust the sample size in those strata with a larger population and higher proportion of COVID confirmed cases, ensuring a precise estimate of the prevalence. In addition, a two-stage sampling method was done at the household level, from which one of the household members was selected for the survey. In total, a sample size of 47900 people were selected and about 23.0% responded completely the survey.
The resulted data contained 10978 number of responses and 99 variables. Due to the large size of the number of variables, we only selected the ones that we were mostly interested in. We believed that the selected variables were most likely significant in modeling the prevalence before attempting to look into the data. After all, a variable showing if the respondent had a family doctor or not might be less likely to affect the prevalence than a variable showing the vaccination status. However, one must note that there might be predictors that could indirectly affect the response variable. For example, one could find the variable showing the response to the following question: “What are the reasons you would not get the COVID-19 vaccine? - Do not consider it necessary to get the vaccine”. This variable might have influence on the prevalence because no vaccine was given to the respondent. However, we thought that it was rather less informative because the information was already reflected in vaccination status. Therefore, we only chose those variables that can have a direct impact on the prevalence. Moreover, variables could have invalid categories like “Valid skip” or “Not stated”. These categories were present due to regulation and law reinforcement, and the survey is designed entirely voluntary. Therefore these categories were treated by us as missing data. Any variable with a high percentage of missing values (>25%) were dropped.
We gave a data definition in Table 2.3.1 below.
| Table 2.3.1: COVID Status Data Definition | |||||
| Variables | Type | Example | Number.Unique | PctMissing | Comment |
|---|---|---|---|---|---|
| Covid_Status | factor | NA, No | 3 | 70.31% | Had the respondent ever had a positive test result? |
| chronic | factor | No, Yes | 3 | 3.1% | Had the respondent reported having chronic condition? |
| DirectContact | factor | Yes, NA | 3 | 20.44% | In the last six months, had the respondent worked in direct contact with people? |
| Smoke | factor | No, NA | 3 | 19.12% | Does the respondent currently smoke tobacco? |
| WashHand | factor | Always, Often | 5 | 0.24% | Wash hands often? |
| WearMask | factor | Always, NA | 5 | 0.6% | Wear a mask in indoor public spaces where physical distancing is difficult or a mandatory mask by-law exists? |
| Keep2m | factor | Often, Always | 5 | 0.4% | Keep a 2 meter or 6 foot distance from others? |
| AvoidCrowds | factor | Often, Always | 5 | 0.9% | Avoid crowds and large gatherings? |
| FluVac | factor | Yes, No | 3 | 0.09% | In the past 12 months, have you had a seasonal flu vaccine? |
| VaccineStatus | factor | No, NA | 3 | 0.17% | Received at least one vaccine dose against COVID-19? |
| Sex | factor | 2, 1 | 3 | 0.12% | Sex: 1 - Male, 2 - Female |
| Age | factor | 3, 1 | 5 | 0.01% | Age group: 1-19, 20-39, 40-59, 60 and older |
| NumHouse | factor | 3, 4 | 5 | 0.77% | Number of people living in household: 1, 2, 3, and 4 or more |
| AntiBodyResult | factor | Negative, Indeterminate | 3 | 0% | The overall interpretation of the laboratory result is that if 0 of 3 antigen tests was positive, the respondent had an overall negative test for antibodies against SARS-CoV-2, if 1 of 3 antigen tests was positive, the respondent had an overall indeterminate test for antibodies against SARS-CoV-2, and if 2 or more of 3 antigen tests were positive, the respondent had an overall positive test for antibodies against SARS-CoV-2. |
To fully understand the relationship between the response variable Covid_Status with other predictors, we fitted logistic models in Section 3.3 and provided additional inferences.
In order to discuss the probability of COVID-19 death in the total death, we first calculated the proportion for the COVID-19 death from 2020 to 2023 in Table 3.1.1. To our surprise, the proportion in 2022 is the higher than the proportion in 2020, 0.0574 and 0.0490 respectively. This might because the outbreak of the new variant Omicron. The proportion in 2021 and 2023 are relatively low might because the population of vaccination increase.
| Table 3.1.1: Contingency table for proportion of COVID-19 death | ||
| Year | Covid Death | Not Covid Death |
|---|---|---|
| 2020 | 0.0490 | 0.9510 |
| 2021 | 0.0463 | 0.9537 |
| 2022 | 0.0574 | 0.9426 |
| 2023 | 0.0231 | 0.9769 |
To test the homogeneity for COVID-19 death probability condition on years, we can use the Chi-square test and the null and alternative hypothesis of homogeneity corresponding to: \[\begin{gather*} H_0:P_{j|i}\ =\ P_{·j}\\ H_1:P_{j|i}\neq P_{·j} \end{gather*}\]
| Table3.1.3:Result for test homogeneity between COVID-19 death and Year | ||
| Test | Chi-Squared Statistic | P-Value |
|---|---|---|
| Chi-squated test | 3107.691 | < 0.05 |
| Likelihood ratio test | 3538.714 | < 0.05 |
The Chi-squares statistics computed by Chi-squared test and Likelihood ratio test is different but the p-value is less than 0.05 in both test. Thus we reject the null hypothesis under the 0.05 level since there have strong evidence that exist significant difference in probability in COVID-19 death probability condition on years.
Then we can compute the relative risk and odds ratio for years to measure the association between years and COVID-19 death proportion. We chose the COVID-19 death proportion in 2020 year as baseline category and compute the relative risks and odds ratios.
| Table 3.1.4: Relative risks for three years | |||
| Year | 2021 | 2022 | 2023 |
|---|---|---|---|
| Relative risk | 0.9449 | 1.1714 | 0.4714 |
Table 3.1.4 showed the Relative risks in 2021,2022 and 2023. We can see that relative risks in 2021,and 2023 are less than 1, we can concluded that if a people died in COVID-19, this people is more likely died in 2020 than 2021 and 2023. The relative risk in 2022 are greater than 1, we can concluded that if a people died in COVID-19, this people is more likely died in 2022 than 2020.
| Table 3.1.5: Odds ratio for three years | |||
| Year | 2021 | 2022 | 2023 |
|---|---|---|---|
| Odds Ratio | 1.0613 | 0.8461 | 2.179 |
From Table 3.1.5, we can see that the odds ratios for all three years are not equal to 1, which indicated that there exists association between year and COVID-19 death proportion. For odds ratios in 2021 and 2023, there exist positive association between probability of death caused by COVID-19. The association in 2022 is negative between proportion of COVID-19 death in total death.
The hypothesis we were mostly interested in was:
\[\begin{gather*} H_0: \beta_j = 0 \\ H_0: \beta_j \neq 0 \end{gather*}\]
for each coefficient related to its corresponding covariate. In other words, we wanted to describe the relationship between the COVID status of a participant and other predictors. We selected the model with the least AIC value shown in Table 3.3.1.
Our best model had the following coefficients shown in the summary Table 3.3.2, from which we found that the variable FluVac that indicated whether or not the participant had a flu shot for the past 12 months had a p-value = 0.155 not significant enough, so we did not have a conclusion on whether the variable was associated with the response.
We looked at the AIC when the flu vaccine status variable was removed. One can see in Table 3.3.3 below that there was not much difference in the AIC value. For model simplicity we therefore omitted the variable.
Similarly, from summary Table 3.3.2 above we found that the variable DirectContact that showed whether the participant had directed contact with people or not also had an unpromising p-value=0.053. After removing the FluVac variable first, we compared the models with and without DirectContact variable using the deviance. From the following summary Table 3.3.4 one can see that the p-value of the Chi-squared statistic was 0.085, suggesting again that we did not have a conclusion whether or not there was an association between the COVID status and having direct contact with people.
We again dropped the variable DirectContact because the AIC didn’t change significantly after removal:
Therefore, we concluded that our model was: \[\begin{equation*} \ln{(\frac{p_i}{1-p_i})} = -5.565 -4.568 \times \text{VaccineStatusYes}_i +3.392 \times \text{AntiBodyResultIndeterminate}_i +6.602 \times \text{AntiBodyResultPositive}_i \end{equation*}\]
from the following summary Table 3.3.6:
We further did another model estimation for the COVID status and some preventative behaviours like washing hands, wearing masks, keep 2 metres and avoiding crowds. The result in summary Table 3.3.7 showed that only washing hands often an always were significant enough to be negatively related to the COVID status whereas all other covariates were inconclusive.
From Section 3.1, we found that the probability of death caused by COVID-19 is not homogeneous across years, then we computed the relative risks and odds ratios for 2021,2022 and 2023. We observed that relative risk in 2022 (RR = 1.1714) indicates a higher risk compared to the 2020, while relative risk in 2021 (RR = 0.9449) suggests a slightly lower risk. Notably, relative risk in 2023 (RR = 0.4714) stands out with a significantly lower risk, suggesting a potential protective effect. The odds ratio in 2023 (OR = 2.179) stands out, indicating a significantly higher odds compared to the odds in 2020. In contrast, odds ratio in 2022 (OR = 0.8461) suggests a lower odds, while odds ratio in 2021 (OR = 1.0613) demonstrates a subtle increase.
The lower risk and higher odds in 2021 and 2023 might because the widespread vaccination in Canada. Public Health Ontario states(Ontario 2023) that over 70.2% of Ontario residents received at least one dose of COVID-19 vaccine in the full year 2021, over 60% completed two doses vaccination. Also, the vaccine still has high vaccine effectiveness against variants of concern Alpha and delta. Till October 2023, 80.5% people in total population in Canada completed their primary series vaccination, over 4 million people received booster dose Pfizer-BioNTech Comirnaty vaccine(Government of Canada 2023b). The public health restrictions and mandatory masking policies dropped, there has also been a shift from the young to the old, with more than 80% of deaths occurring in patients over 65 years old with comorbidities(Kulkarni 2022).
From Section 3.3 we have found that the odds ratio of COVID-19 was related to two covariates: the vaccination status and the antibody presented in blood. Specifically, we interpreted the coefficients as the log-odds for its corresponding covariate.
\[ \ln{OR}=\beta_j, \quad OR:=\frac{p_2(1-p_1)}{p_1(1-p_2)} \]
In other words, \(e^{\beta_j}\) is the marginal increase/decrease in the odds for a on-unit increase/decrease in the covariate, assuming all other covariates held constant. On the other hand, the constant coefficient \(\beta_0\) is interpreted as the log-odds \(\ln{(\frac{p_1}{1-p_1})}\) with all covariates unchanged. From our summary Table 3.3.6 one can see that the coefficient for the vaccine status was negative, indicating that the odds of having a positive COVID test decreases if vaccine was given. This result wasn’t surprising that the use of vaccines has so far helped the humanity combat this virus. The coefficient for the indeterminate antibody result was positive, and the one for the positive antibody result was even higher. We must point out that one needs to carefully interpret this result. It meant that the odds of getting a positive COVID test is positively correlated with the result of an antibody test. The more positive the antibody test is, the higher the odds of having a positive COVID test as well. However, antibody test and the COVID-19 diagnostic test are not the same thing in the explainations provided by FDA (2023). The antibody test does not detect the virus. Rather, it merely tells if a person may have had a PRIOR infection, thus it does not reflect if the person is currently infected or not. In addition, the antibody test could show if a person has been vaccinated or not, but in general an antibody test may not detect the kind of antibodies created by vaccines, therefore it depends on the type of antibody test performed. From our result we were only able to say that there was a positive relationship between the COVID diagnostic test and the antibody test, which was not surprising because in order to show positive in an antibody test, one must have had COVID to begin with. This information may be useful, for example, that one of the tests is economically more affordable and can be used as a preliminary screening method.
From the second model fitting for preventative behaviours, we found that only washing hand was showing a negative effect on the odds of getting COVID. We weren’t able to draw any conclusion for other preventative behaviours, but we thought it was inevitably hard to find a relationship between the COVID status and those behaviours because people may not answer the questionnaire accurately. People might find difficult to distinguish the boundary between wearing mask often and always. People can even falsely answer that they keep a distance of 2 metres or more but in reality they have not done so. The resulting answers for the survey therefore may not be as reliable. Thus we thought it is generally difficult to accurately describe the relationship between prevalence of disease and preventative behaviour. Researchers have to design experiments and find ways to quantify the behaviour in order to have more reliable outcomes.
In Mortality section, we found that there exist significant difference in probability of death caused by COVID-19 across years. 2023 and 2021 have the relative risk less than 1 and odds ratio higher than 1, which indicate the positive association, indicating an increased likelihood of the in probability in COVID-19 death in these years. 2022 has the relative risk greater than 1 and odds ratio less than 1, which indicate the negative association, indicating an decreased likelihood of the in probability in COVID-19 death in these years.
We found in Prevalence Modeling section that the COVID status is negatively associated with the vaccination status, indicating that vaccines was a significant factor to lower the prevalence of the virus. We also found but not surprising that the antibody test result was positively related to the COVID status. In addition, we did confirm that washing hands can result in a negative influence on the prevalence of COVID, but we could not find the same conclusion for other preventative behaviours.